Appropriate for summarizing a set of numbers (continous variables)
Choose a bin size and a center value, e.g. one hour bins centered at the integers would be denoted as \((5.5, 6.5]\), \((6.5, 7.5]\), \((7.5, 8.5]\), etc. Bins will be non-overlapping. Calculate enough bins to completely cover data
Assign each runner to a bin, e.g. 13.50 goes into the \((12.5, 13.5]\) bin and 13.51 goes in to the \((13.5, 14.5]\) bin
Plot bars for each bin, with the height of the bar corresponding to the number of runners in that bin
Random vertical jitter allows to see multiple datapoints with same value
Violin plots
More recent introduction is ‘violin plot’: density plot and its reflection
Easy to create and show for multiple variables at once, like boxplots, but better provides more accurate representation of distribution, like density plot
Adding multiple geometric objects to one ggplot object results in multiple layered views of the information.
Example of jittered points over boxplot:
ggplot(ultrarunning) +1geom_boxplot(aes(x = pb100k_dec), outlier.shape =NA) +2geom_jitter(aes(x = pb100k_dec, y =0), width =0, height =0.33) +scale_x_continuous(name ="Personal best time (hours)") +scale_y_continuous(breaks =NULL, name =NULL, limits =c(-1, 1)) +theme(text =element_text(size =24))
1
First create boxplot
2
Then add points across \(y=0\) line with random vertical jitter
Other examples of layering we’ve seen so far: layering density over histogram of running times (slide 15); data points over a boxplot (slide 24); boxplot over violin plot (slide 27)
When to layer?
Some situations when you would want to layer:
When single layer has critical weakness / deficiency (Avoid distorting what the data say)
When you want to highlight both granular and aggregate components of the data (Reveal the data at several levels of detail, from broad overview to fine structure)
To anchor your data (layer A) within context of reference / other data (layer B) (Encourage comparison between different pieces of data)
Medium Spice: Make the layered violin plots on slide 27 or slide 28;
Yoga Flame: Make the layered Density+Histogram on slide 15 (hint: use after_stat to get the correct y-axis for the histogram); or the layered boxplot on slide 25 (hint: use geom_jitter instead of geom_point); or one of the barcharts on slides 35, 36, or 37 (hint: you will need to use case_when to create some character variables before making the plots);
Dim Mak: Make one of the bivariate violin plots on slides 40, 41, or 42;
References
Samtleben, E., 2023. Ultrarunning dataset. Teaching of Statistics in the Health Sciences Resource Portal, Available at https://www.causeweb.org/tshs/ultra-running/.